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Abstract. Astronomers have corae to rely on the increasing performance of 
computers to reduce, analyze, simulate and visualize their data. In this envi- 
ronment, faster computation can mean more science outcomes or the opening 
up of new parameter spaces for investigation. If we are to avoid major issues 
when implementing codes on advanced architectures, it is important that we 
have a solid understanding of our algorithms. A recent addition to the high- 
performance computing scene that highlights this point is the graphics processing 
unit (GPU). The hardware originally designed for speeding-up graphics render- 
ing in video games is now achieving speed-ups of 0(100 x) in general-purpose 
computation - performance that cannot be ignored. We are using a generalized 
approach, based on the analysis of astronomy algorithms, to identify the opti- 
mal problem-types and techniques for taking advantage of both current GPU 
hardware and future developments in computing architectures. 



1. Introduction 

Modern astronomy has come to rely heavily on high-performance computing 
(HFC). However, all research areas are facing significant challenges as data vol- 
umes approach petabyte levels. For instance, the Australian Square Kilometre 
Array Pathfinder project will produce data at a rate that makes storage in 
raw form impractical, necessitating on-the-fly reduction and analysis to produce 
4GB/s of products. On the modeling front, there is an ongoing desire for larger 
and more-detailed simulations (e.g., the Aquarius simulation by Springel et al. 
2008). 

The HPC scene has recently witnessed the bold introduction of the graphics 
processing unit (GPU) as a viable and powerful general-purpose co-processor to 
CPUs. GPUs were developed to off-load the computations involved in rendering 
3D graphics from the CPU, primarily to benefit video-games. Their continued 
development has been driven by the $60 billion/year video-game industry. The 
result of this development can be seen in Figure 1, which plots the clock-rate 
of a number of CPUs and GPUs against their core-count. GPUs appear toward 
the top of the plot, exhibiting very high core-counts and performance. Since 
2005 (an area on the plot we refer to as the "multi-core corner"), clock-rates 
in CPUs have plateaued, and manufacturers have instead turned to increasing 
the number of cores per chipQ. One might therefore consider that CPUs are 



^This has to do with the difficulty in dissipating the heat produced at higher clock-rates. 
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Figure 1. Clock-rate versus core-count phase space of Moore's Law. Black 
dots represent central processing units (CPUs); white diamonds represent 
graphics processing units (CPUs). Diagonal lines are contours of equal pro- 
cessing power. 



now becoming progressively more GPU-like, and that current GPUs provide a 
picture of future commodity computing architectures. 

While CPUs are becoming more GPU-like, the reverse can also be said, with 
GPUs offering increasingly flexible computing platforms. This flexibility, com- 
bined with the availability of general-purpose programming tools, has opened 
up GPU computation to a wide range of non-graphics-related tasks, notably in 
the area of HPC. 

Both the immediate performance boost provided by GPUs and the expected 
future of CPU computing provide strong motivation for a thorough analysis of 
the performance and scalability of our astrophysics algorithms in advanced par- 
allel processing environments. If harnessed correctly, the power of massively- 
parallel architectures like GPUs could lead to significant speed-ups in computa- 
tional astronomy and ultimately to new science outcomes. 

A number of astronomy algorithms have been implemented on GPUs to 
date, including direct N-body simulations (e.g., Belleman, Bedorf & Portegies 
Zwart 2008), the solution of Kepler's equation (Ford 2008), radio astronomy 
correlation (e.g., Harris, Haines & Staveley-Smith 2008), phase-space studies of 
binary black hole inspirals (Herrmann et al. 2009) and gravitational lensing ray- 
shooting (Thompson et al. 2010). These projects have all reported speed-ups 
of O(IOO) over CPU codefl However, these algorithms are for the most part 
"embarrassingly parallel" "low-hanging fruits", meaning that they can be run 
on a parallel processing architecture with little or no overhead. This makes them 
obvious candidates for efficient GPU implementation. The question that remains 
is: exactly which classes of astronomy algorithms are likely to obtain significant 
speed-ups by executing on advanced, massively-parallel, architectures? 



^It is understood that these speed-up measurements are unlikely to have been obtained in a 
consistent manner; we merely emphasize their order of magnitude. 
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2. Our Approach 

We propose a generalized approach based around two key ideas: 

1. Developing and applying an algorithm analysis methodology relevant to 
new hardware architectures; and 

2. Building and using a taxonomy of astronomy algorithms. 

We believe that such an approach will minimise the effort required to turn the 
"multi-core corner" for computational astronomy and ensure that the solutions 
found will continue to scale with future advances in technology. 

Here we briefly outline a number of "rules of thumb" that may be applied 
when analyzing astronomy algorithms with respect to their potential on GPUs. 
Further details will be presented in Barsdell, Fluke & Barnes (in prep.). 

Massive Parallelism: Given the large number of processing cores avail- 
able in GPUs, it is critical that an algorithm be divisible into many fine-grained 
parallel elements in order to fully utilize the hardware (e.g., an NVIDIA GT200- 
class GPU may be under-utilized with < O(IO^) threads). Partitioning data, 
rather than tasks, between parallel threads generally offers a large and scalable 
quantity of parallelism. This is referred to as the "data-parallel" approach. 

Memory Access Locality and Patterns: GPU architectures contain 
very high bandwidth main memory, O(100GB/s), which is necessary to "feed" 
the large number of parallel processing units. However, high latency (i.e., mem- 
ory transfer startup) costs mean that performance depends strongly on the pat- 
tern in which memory is accessed. In general, maintaining "locality of refer- 
ence" (i.e., neighboring threads accessing similar locations in memory) is vital 
to achieving good performance. 

Branching: GPUs contain "single instruction multiple data" (SIMD) hard- 
ware. This means that neighboring threads that wish to execute different in- 
structions must wait for each other to complete the divergent code section before 
execution can continue in parallel. For this reason, sections of GPU code that 
are conditionally executed by only a subset of threads should be minimized. 

Arithmetic Intensity: Executing arithmetic instructions is generally much 
faster than accessing memory on GPU hardware and thus increasing the number 
of arithmetic operations per memory access can help to hide memory latencies. 
This is not always possible, and some algorithms will remain bandwidth-limited 
rather than instruction-limited. However, this is a case where a more drastic 
re-think of a problem may be required for an efficient solution. For example, the 
optimal order of a numerical expansion may be different on a GPU architecture 
than on a CPU architecture. 

Host— Device Memory Transfers: GPUs and their host machines (typ- 
ically) have distinct memory spaces, meaning they must communicate via the 
PCI- Express bus, which exhibits relatively low bandwidth (currently ~5GB/s). 
Transferring data to and from a GPU device can therefore be a significant per- 
formance bottle-neck in some situations. 



3. Algorithm Classification 

Here we present an initial classification, based on application of the "rules of 
thumb" and reduction to known GPU-efficient algorithms, of a selection of im- 
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portant astronomy problems. High efficiency algorithms correspond to expected 
0(100 x) speed-ups, while moderate efficiency algorithms are expected to exhibit 
speed-ups of 0(10 x) over traditional CPU implementations: 



Field High efficiency Moderate efficiency 



• Direct N-body • Tree-code N-body and SPH 

• Fixed-resolution mesh simulations • Halo-finding 



Simulation 


• Semi-analytic modelling 

• Gravitational lensing ray-shooting 

• Other Monte-Carlo methods 


• Adaptive mesh refinement 




• Radio-telescope signal correlation 


• Pulsar signal processing 




• General image processing 


• Stacking/mosaicing 


Data reduction 


• Flat-fielding etc. 


• CLEAN algorithm 




• Source extraction 


• Gridding visibilities and 




• Convolution and deconvolution 


single-dish data 


Data analysis 


• Machine learning 

• Fitting/optimisation 

• Numerical integration 

• Volume rendering 


• Selection via criteria-matching 



4. Conclusion 

Modern astronomy relies heavily on HPC, and GPUs can provide both signifi- 
cant speed-ups over current GPUs and a glimpse of the probable future of com- 
modity computing architectures. However, their more complex design means 
algorithms must be considered carefully if they are to run efficiently on these 
advanced architectures. There is therefore strong motivation to thoroughly ana- 
lyze and categorize the algorithms of astronomy in order to take full advantage of 
current and future advanced computing architectures and maximize our science 
outcomes. 

Our preliminary analysis of a broad selection of important astronomy prob- 
lems leads us to conclude that the data-rich nature of computational astronomy, 
combined with the efficiency of data-parallel algorithms on current GPU hard- 
ware, make for a very promising relationship with current and future massively- 
parallel architectures. Processors are likely to become even more flexible in the 
future, potentially improving the efficiency of many astronomy algorithms and 
opening up new avenues to significant speed-ups. 
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